[AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget by duburcqa · Pull Request #495 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-17T11:20:24Z

Guard against LLVM worker-thread stack overflow from large per-task adstack budget

CPU-only codegen guard. Rejects compilation when the cumulative AdStackAllocaStmt::size_in_bytes() in a single LLVM task crosses the ~256 KB secondary-thread stack budget; without it, the frame silently clobbers adjacent stack memory and the reverse pass returns zero / garbage gradients.

TL;DR

void TaskCodeGenLLVM::visit(AdStackAllocaStmt *stmt) {
  QD_ASSERT_INFO(stmt->max_size > 0, "...");
  auto type = llvm::ArrayType::get(llvm::Type::getInt8Ty(*llvm_context), stmt->size_in_bytes());

  if (arch_is_cpu(current_arch())) {
    constexpr std::size_t kFnScopeAdStackBudgetBytes = 256 * 1024;
    ad_stack_fn_scope_bytes_ += stmt->size_in_bytes();
    QD_ERROR_IF(ad_stack_fn_scope_bytes_ > kFnScopeAdStackBudgetBytes,
                "LLVM autodiff-stack budget exceeded: cumulative `AdStackAllocaStmt` size {} bytes in task "
                "'{}' crosses the {} byte function-scope budget. ...",
                ad_stack_fn_scope_bytes_, kernel_name, kFnScopeAdStackBudgetBytes);
  }

  auto alloca = create_entry_block_alloca(type, sizeof(int64));
  // ...
}

The check is CPU-only by design: on CUDA / AMDGPU the same LLVM allocas lower to per-thread GPU local memory (a separate address space sized by the driver, not shared with the CPU call stack), so the 256 KB CPU-stack budget is not meaningful there. A non-gated version of the check would falsely reject valid GPU kernels with f64 loop-carried variables (4 adstacks at ad_stack_size=4096 already cross 256 KB).

Why 256 KB

macOS secondary threads default to a ~512 KB stack. The worker-thread pool used by the LLVM JIT runs on those. A function-scope alloca for an adstack sits on the LLVM stack frame; if the sum of its sizes across a task crosses the thread's stack limit, the frame corrupts adjacent stack memory (typically the next page, sometimes guard pages). Downstream reverse-mode accumulators read zero, producing silently-wrong gradients with no crash.

256 KB is a conservative upper bound that leaves ~256 KB for other locals and nested call frames. Linux defaults to ~8 MB per secondary thread, so the limit is strictly conservative there — the codegen is more protective on Linux than strictly necessary, which is fine.

Why QD_ERROR_IF and not throw

QD_ERROR_IF logs a descriptive message and then calls QD_UNREACHABLE (compiler-assisted abort). The codegen runs inside the LLVM compilation worker thread pool, where a C++ exception thrown from a worker doesn't cleanly propagate back to the Python-level caller (pybind11's exception translation only works at pybind binding boundaries, which are on the main thread). Throwing QuadrantsRuntimeError from here results in std::terminate() and a bare SIGABRT — a worse user experience than the guard's current behaviour, which at least logs the message before aborting.

The trade-off is that the abort shows up as a SIGABRT in the child process rather than a catchable Python exception. The testing approach below handles that.

Obsoleted by subsequent PRs

Autodiff 12 (LLVM heap-backed adstack) moves the storage off the worker thread stack entirely, making this guard unnecessary. When that PR lands, the guard — along with the ad_stack_fn_scope_bytes_ accumulator field — is removed. This PR is still useful in the interim to make the silent-corruption failure loud, and to provide coverage for any future path that might re-introduce function-scope allocas for adstack storage.

Changes

`quadrants/codegen/llvm/codegen_llvm.{h,cpp}`

TaskCodeGenLLVM gains an ad_stack_fn_scope_bytes_ accumulator field, reset to 0 at the top of each offloaded task.
init_offloaded_task_function resets the accumulator.
visit(AdStackAllocaStmt) tallies per-alloca size and raises via QD_ERROR_IF when the per-task total crosses 256 KB on CPU.

Tests

`test_adstack_codegen_budget_guard_runs_in_child_process`

Runs the overflowing kernel in a child process (since QD_ERROR_IF aborts the process rather than raising a catchable Python exception) and asserts:

The child exits with a non-zero returncode (the guard fired and terminated the process).
The guard's message ("autodiff-stack budget exceeded") appears in the child's combined stdout/stderr.

Skip-gated on qd.cpu having the adstack and f64 extensions. Kernel shape: five qd.f64 loop-carried variables applying qd.sin inside a dynamic range(n_iter[None]) at ad_stack_size=4096. Each adstack is 8 + 4096 * 16 = 65,544 bytes ≈ 64 KB; five × 64 KB = 320 KB, comfortably past the 256 KB guard.

The "dynamic n_iter from a field" shape is load-bearing: with a Python-literal range(3) the compile-time trip count is known and the determine-ad-stack-size pass sizes each stack to 3 slots, not default_ad_stack_size=4096. Only a runtime-bound range defeats that pass and leaves each stack at the full default_ad_stack_size.

Side-effect audit

Concern	Verdict
Existing CPU kernels with small adstacks	Unaffected — 256 KB budget is generous for default-sized adstacks (32 slots × entry bytes).
CUDA / AMDGPU	Guard is gated on `arch_is_cpu(current_arch())`; never fires on GPU.
Linux vs macOS	256 KB is conservative on both; Linux stacks are larger by default, so the guard is strictly more protective than necessary there.
Valid-but-large GPU kernels	Not rejected (CPU-only gate).
Silent corruption path	Surfaced with a descriptive abort message instead of wrong gradients.
Future removal	Explicitly obsoleted by the LLVM heap-backed PR (Autodiff 12), which removes the guard and its accumulator.

Stack

Autodiff 9 of 13. Third commit of the "LLVM adstack safety" triplet split and the top-most of that split (this PR is #495). Based on #535 (runtime overflow). Followed by #490 (SPIR-V adstack).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5d3009268

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hughperkins · 2026-04-17T13:41:14Z

Opus review of test factoriztaion:

Looking at the four tests side by side:

Test	Loop type	Body	Expected grad
`test_adstack_sum_fixed_coeff`	dynamic `range(3)`	`y += v`	`3.0`
`test_adstack_sum_constant_coeffs`	dynamic `range(3)`	`y += v * (a+1)`	`6.0`
`test_adstack_sum_static_loop_correct`	static `qd.static(range(3))`	`y += v`	`3.0`
`test_adstack_basic_gradient`	dynamic `range(10)`	`v = v * 0.95 + 0.01`	`0.95^10`

Tests 2, 3, and 4 share identical setup/teardown/assertion structure and only differ in two dimensions: dynamic vs static loop, and fixed vs varying coefficient. These three could collapse into a single parametrized test using if qd.static(...) to switch at compile time:

@pytest.mark.parametrize("use_static_loop", [True, False])
@pytest.mark.parametrize("use_varying_coeff", [True, False])
@test_utils.test(require=qd.extension.adstack, ad_stack_experimental_enabled=True)
def test_adstack_sum_linear(use_static_loop, use_varying_coeff):
    x = qd.field(qd.f32)
    y = qd.field(qd.f32)
    qd.root.dense(qd.i, 1).place(x, x.grad)
    qd.root.place(y, y.grad)

    @qd.kernel
    def compute():
        for i in x:
            v = x[i]
            if qd.static(use_static_loop):
                r = qd.static(range(3))
            else:
                r = range(3)
            for a in r:
                if qd.static(use_varying_coeff):
                    y[None] += v * qd.cast(a + 1, qd.f32)
                else:
                    y[None] += v

    x[0] = 1.0
    y[None] = 0.0
    compute()
    y.grad[None] = 1.0
    x.grad[0] = 0.0
    compute.grad()

    expected = sum((a + 1) for a in range(3)) if use_varying_coeff else 3.0
    assert x.grad[0] == test_utils.approx(expected, rel=1e-4)

This gives you 4 combinations (including static + varying coeff, which is new free coverage) in one test, and eliminates ~50 lines of duplication.

test_adstack_basic_gradient should stay separate â€” it's the only one with loop-carried state mutation (v = v * 0.95 + 0.01), which is a structurally different pattern. Though it could benefit from a brief comment explaining that distinction.

hughperkins · 2026-04-17T13:41:36Z

Please could you address Opus's suggetsions on test factroization.

hughperkins · 2026-04-17T13:42:49Z

+// The stack index `n` is clamped on read so that overflow (push past capacity) does not let subsequent pops and
+// top-accesses underflow it and index far out of bounds. The corresponding stack_push sets
+// `runtime->adstack_overflow_flag` and skips the increment instead of trapping, so the host-side launcher
+// surfaces the failure as a Python exception rather than killing the process via __builtin_trap. When n == 0


hughperkins · 2026-04-17T13:47:10Z

Asking Opus about any runtime impact:

Hot-path branches in stack operations. stack_push, stack_pop, and stack_top_primal are called on every iteration of every AD loop that uses the adstack. This PR adds a new
branch to each:

• stack_push: bounds check if (n + 1 > max_num_elements) before every push, plus an extra LLVMRuntime *runtime pointer argument threaded through every call site.
• stack_pop: underflow guard if (n > 0) before every pop.
• stack_top_primal: conditional n > 0 ? n - 1 : 0 instead of a bare n - 1.

In normal (non-overflowing) operation the branches are always not-taken, so CPU branch prediction should handle them well. On GPU, all threads in a warp would take the same
path (not-taken), so no divergence cost. But they do add instruction count and register pressure to what is essentially the innermost loop of the backward pass.

synchronize() poll. Every qd.sync() now calls check_adstack_overflow(), which makes a JIT call into the runtime to read and reset the flag, plus a result buffer fetch. This is
once per sync (not per iteration), so it's unlikely to be measurable unless the user syncs at very high frequency.

In practice, the cost is probably negligible — a couple of predictable branches per stack operation. But if someone wanted to benchmark it, the stack push/pop overhead in a
tight AD loop with many iterations would be the place to look.

hughperkins · 2026-04-17T13:51:22Z

Opus response for test coverage:

Code changes in this PR:

Change	Exercised by test?
`stack_push` overflow check + early return	Yes â€” the 64-iteration kernel triggers it
`stack_push` sets `runtime->adstack_overflow_flag = 1`	Yes â€” the test expects the resulting exception
`stack_push` new `LLVMRuntime *runtime` parameter	Yes â€” indirectly, the kernel wouldn't compile otherwise
`stack_pop` underflow guard (`if (n > 0)`)	No â€” only reachable after overflow, when the reversed loop pops more times than were successfully pushed. The test raises before observing this path.
`stack_top_primal` clamping (`n > 0 ? n - 1 : 0`)	No â€” same situation, only reachable post-overflow.
`LLVMRuntime::adstack_overflow_flag` field	Yes
`runtime_retrieve_and_reset_adstack_overflow`	Yes â€” called by `check_adstack_overflow`
`LlvmRuntimeExecutor::check_adstack_overflow()`	Yes
`check_adstack_overflow` early return when `llvm_runtime_ == nullptr`	No â€” only reachable during early teardown before `materialize_runtime`
`result_buffer_cache_` caching in `materialize_runtime`	Yes â€” indirectly, needed for `check_adstack_overflow` to work
`LlvmProgramImpl::synchronize()` calls `check_adstack_overflow`	Yes
`LlvmProgramImpl::finalize()` sets `finalizing_ = true`	No â€” no test verifies that overflow during teardown doesn't crash
`codegen_llvm.cpp` passes `get_runtime()` to `stack_push`	Yes â€” indirectly
`internal_functions.h` `test_stack` updated	No â€” no test calls `test_internal_func_args` / `test_stack`

Untested functionality implied by the design:

Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.
Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.
Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.
Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.
Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.
Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() â†’ finalize() â†’ synchronize() doesn't throw into a destructor and terminate the process. No test covers this path â€” e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

hughperkins · 2026-04-17T13:52:34Z

Please could you add tests for untested fucntionality implied by the design:

Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.
Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.
Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.
Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.

5. Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.

Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() â†’ finalize() â†’ synchronize() doesn't throw into a destructor and terminate the process. No test covers this path â€” e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

hughperkins · 2026-04-17T14:25:16Z

Oh, also please add tests for the rows in the table above with 'no', ie:

stack_pop underflow guard (if (n > 0))
stack_top_primal clamping (n > 0 ? n - 1 : 0)
check_adstack_overflow early return when llvm_runtime_ == nullptr
LlvmProgramImpl::finalize() sets finalizing_ = true
internal_functions.h test_stack updated

claude

LGTM — all review issues addressed, including the CPU-only arch guard on the 256 KB adstack budget check.

Extended reasoning...

Overview

The PR adds adstack overflow detection to the LLVM autodiff runtime: stack_push now sets runtime->adstack_overflow_flag via a relaxed atomic store on overflow, stack_pop/stack_top_primal have underflow/clamping guards, LlvmRuntimeExecutor::check_adstack_overflow() polls the flag on every synchronize(), and LlvmProgramImpl::pre_finalize() suppresses the poll during teardown. Additionally AdStackAllocaStmt::size_in_bytes() is corrected from sizeof(int32) to sizeof(int64) to match the runtime layout. 11 files touched across codegen, IR, runtime, and tests.

Security risks

No security-sensitive surfaces (no auth, crypto, or permission code). The relaxed-atomic fix for the data race in stack_push is a correctness improvement.

Level of scrutiny

This is production-critical runtime code (autodiff correctness on all LLVM backends), but the changes are well-bounded safety additions rather than logic rewrites. Every bug raised across multiple review rounds was addressed:

Comment inaccuracies (error_code vs adstack_overflow_flag) — fixed
finalizing_ timing (pre_finalize() before teardown syncs) — fixed
sizeof header mismatch — fixed
Atomic write for multi-threaded overflow — fixed
CPU-only arch guard on the 256 KB budget check — fixed (fixup commit 851d8fd)
test_stack heap leak — fixed
Test ad_stack_size pins and extension guards — fixed

Other factors

Test coverage is comprehensive: overflow raises, flag reset, teardown safety, multi-threaded overflow, and large-capacity resolution are all exercised. The bug hunting system found no new issues in the final state. The unresolved inline thread on the GPU budget check is moot because the arch guard is present in the submitted diff.

hughperkins · 2026-04-20T13:18:14Z

Opus says this PR is three ~independent fixes. Pelase could we split into three PRs?

These three fixes a are orthogonal and could be pr'd sepraately?

Yes — all three are independently mergeable, with one soft ordering preference. Sketch:

Fix 3: AdStackAllocaStmt::size_in_bytes header size

• Scope: 1 file, ~4 lines (quadrants/ir/statements.h).
• Dependencies: none. Pre-existing bug — runtime always used u64 for the header, LLVM alloca was 4 B short.
• Test: none new required (existing tests now allocate the correct size, and the extended test_stack in fix 1 would surface it, but the fix stands on its own).
• Risk: near-zero. Just makes the alloca 4 B larger.
• Could ship: tomorrow. Smallest, cleanest, highest-confidence change in the PR.

Fix 1: runtime overflow → Python exception + teardown safety

• Scope: ~8 files (runtime.cpp, internal_functions.h, llvm_runtime_executor.{cpp,h}, llvm_program.h, program.cpp, program_impl.h, codegen_llvm.cpp for the stack_push signature
change + Python tests).
• Dependencies: none semantically, but technically the stack_push(runtime, stack, max, elem_size) signature change touches codegen_llvm.cpp too — any PR landing in parallel
that touches that callsite needs rebasing.
• Self-contained test coverage: yes — all five new Python tests + the extended test_stack in internal_functions.h.
• Risk: moderate. New virtual hook (pre_finalize), new runtime flag, changes the always-on path in synchronize(). The teardown subprocess test is specifically pinning one
worst-case interaction.
• Could ship: as its own PR. It's the biggest piece but internally coherent; the push-bounds-check, pop-clamp, top-of-stack clamp, flag-poll, and pre_finalize teardown hook
are all one story: "overflow should surface as an exception, not as silent corruption, even in the unsynced-exit path."

Fix 2: codegen per-task stack-budget guard

• Scope: 2 files (codegen_llvm.cpp, codegen_llvm.h).
• Dependencies: ideally lands after fix 3 so the 256 KB budget math sums accurate per-adstack sizes (not 4 B-too-small figures). Not a correctness issue at this budget
granularity, but technically cleaner.
• Self-contained test coverage: none in this PR. The PR's five new tests all exercise fix 1 (runtime overflow). Fix 2 is the codegen-time check and should have at least one
test that deliberately crosses 256 KB at ad_stack_size=4096 and asserts the QD_ERROR_IF fires. Splitting this out as its own PR makes the absence of that test more visible
and easier to address.
• Risk: moderate. CPU-only gating with a hard-coded constant; a kernel that works fine today but happens to straddle 256 KB will start failing at compile time. Worth a
dedicated review pass.
• Could ship: as its own PR, ideally after fix 3, and ideally with a test added first.

Suggested order

Fix 3 first (trivial, unblocks the others). One-line PR.
Fix 1 second (big, self-contained, carries the Python tests that are directly visible to users).
Fix 2 third (smallest of the two behavior changes, but wants a test the current PR doesn't include — easier to add that test in a dedicated PR than to stretch this one
further).

Why they're getting reviewed together anyway

They share a theme — "stop silent wrong gradients in the adstack path" — and they were probably discovered together while chasing the same symptom. That's normal. But for
review, they land in different files with different risk profiles and different test needs, and reviewers can sign off on them independently. If CI time per PR is cheap,
splitting is a straight win for reviewability and bisectability. If the team prefers bigger atomic changes, the current shape is defensible, but in that case the PR description
should at least be explicit that fix 2 is untested in this PR.

it sounds like we need the test from fix 1 for fix 3?

Yes — kind of. Let me be precise about what "needs" means here.

Fix 3 is provably correct by inspection

The runtime reads/writes the header through *(u64 *)stack at every site (stack_init, stack_push, stack_top_primal). The alloca was sized as if the header were int32. Changing it
to int64 makes the alloca size match the layout the runtime has always used. There's no semantic ambiguity to argue over.

But it has no executable validation in isolation

Without fix 1, the only test exercising the adstack at the C++ level is the old test_stack, which does 4 pushes into a max_size=16 stack — nowhere near the corrupted region. The
corruption from the int32 sizing is in the last 4 bytes of the alloca, which only gets touched when pushes approach max_size. So an old-test_stack-only validation regime would
happily pass with fix 3 reverted.

That's why I said "obviously correct by inspection but with no executable proof" — and why fix 1's rewritten test_stack (which pushes to max_size and reads *(u64 *)stack) is the
test that incidentally pins fix 3.

What this means for splitting

Three options, in increasing test discipline:

Ship fix 3 alone, no test. Justified on "matches runtime layout, strictly increases alloca size by 4 bytes, can't make anything worse." Defensible but leaves the bug class
(alloca-size vs runtime-layout drift) without a regression pin.
Ship fix 3 alone, with a 5-line test. Add a minimal C++ test (or piggyback on existing test_stack) that pushes to max_size and asserts *(u64 *)stack == max_size. That single
assertion would have failed under the old int32 sizing because the high 4 bytes of the header would have been clobbered by the first slot's writes. Smallest validating
change.
Ship fix 3 + fix 1 together (current shape, sort of). Fix 1's test_stack rewrite covers fix 3 as a side effect. You don't need to write a separate test, but you also can't
bisect a fix-3 regression to fix 3 itself.

If splitting, option 2 is the cleanest — a one-liner fix plus a one-line assertion is still a tiny PR, and it gives fix 3 its own regression pin instead of relying on fix 1's
coverage. Then fix 1 lands on top with its more thorough overflow-path tests.

The general principle: a fix that's "obviously right" but covered only by an unrelated test is one accidental refactor away from being silently undone. A 5-line dedicated test is
cheap insurance.

hughperkins · 2026-04-20T13:19:18Z

So lets' split into the following PRs please;

AdStackAllocaStmt::size_in_bytes header size (with small 5-line test)
runtime overflow → Python exception + teardown safety
codegen per-task stack-budget guard

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 quadrants/runtime/llvm/runtime_module/runtime.cpp:1107-1112 — The PR deletes all eight shuffle_down runtime helpers (amdgpu_shuffle_down_i32/f32/i64/f64 and cuda_shuffle_down_i32/f32/i64/f64) from runtime.cpp as incidental cleanup, but codegen_cuda.cpp (~lines 746-752) and codegen_amdgpu.cpp (~lines 449-455) still emit JIT calls to these exact symbol names when lowering subgroup shuffle_down operations. Any CUDA or AMDGPU kernel that uses qd.subgroup.shuffle_down() will fail at JIT link time with an undefined symbol error. Restore the eight deleted functions or update the codegen sites to inline the shuffle_down logic.
Extended reasoning...

What the bug is and how it manifests

The PR diff shows two deletion hunks in runtime.cpp: one removing amdgpu_shuffle_down_i32/f32/i64/f64 (around line 1107) and one removing cuda_shuffle_down_i32/f32/i64/f64 (around line 1143). These are described as incidental cleanup but they are not dead code — they are JIT-linked runtime symbols that GPU codegen actively resolves at kernel compilation time.

The specific code path that triggers it

codegen_cuda.cpp contains an emit_cuda_shuffle_down() function (approximately lines 744–755) that calls:
call("cuda_shuffle_down_i32", offset, value)
call("cuda_shuffle_down_f32", offset, value)
call("cuda_shuffle_down_f64", offset, value)
call("cuda_shuffle_down_i64", offset, value)

codegen_amdgpu.cpp contains an emit_amdgpu_shuffle_down() function (approximately lines 447–458) that calls:
call("amdgpu_shuffle_down_i32", offset, value)
call("amdgpu_shuffle_down_f32", offset, value)
call("amdgpu_shuffle_down_f64", offset, value)
call("amdgpu_shuffle_down_i64", offset, value)

Neither codegen file is modified by this PR, so these eight call() sites remain unchanged and still reference the deleted symbols.

Why existing code does not prevent it

The call() invocations in LLVM codegen resolve symbol names from the JIT-linked runtime bitcode module at kernel compilation time. There is no compile-time check that the symbol exists in the module — the linker error only surfaces when a CUDA or AMDGPU kernel that exercises a subgroup.shuffle_down() operation is compiled. The PR's own test suite does not include CUDA/AMDGPU subgroup tests (the tests added are all CPU adstack tests), so the breakage escapes CI.

What the impact would be

Any user calling qd.subgroup.shuffle_down() on a CUDA or AMDGPU backend will receive a JIT linker error ('undefined symbol: cuda_shuffle_down_i32' or equivalent) the first time the kernel is compiled, turning previously working code into a hard failure. This is a complete regression for GPU subgroup reduction patterns.

How to fix it

Either (a) restore the eight deleted functions — they were not related to the adstack work and their deletion was purely cosmetic — or (b) update codegen_cuda.cpp and codegen_amdgpu.cpp to inline the shuffle_down logic directly instead of calling the runtime helpers.

Step-by-step proof
1. PR diff, runtime.cpp hunk 1 (line ~1107): deletes amdgpu_shuffle_down_i32, amdgpu_shuffle_down_f32, amdgpu_shuffle_down_i64, amdgpu_shuffle_down_f64.
2. PR diff, runtime.cpp hunk 2 (line ~1143): deletes cuda_shuffle_down_i32, cuda_shuffle_down_f32, cuda_shuffle_down_i64, cuda_shuffle_down_f64.
3. grep for 'shuffle_down' in the post-PR runtime.cpp returns zero matches — the functions are gone.
4. codegen_cuda.cpp (not modified by PR): emit_cuda_shuffle_down() at lines 746/748/750/752 calls call("cuda_shuffle_down_i32", ...), call("cuda_shuffle_down_f32", ...), call("cuda_shuffle_down_f64", ...), call("cuda_shuffle_down_i64", ...).
5. codegen_amdgpu.cpp (not modified by PR): emit_amdgpu_shuffle_down() at lines 449/451/453/455 calls the amdgpu variants.
6. When any CUDA/AMDGPU kernel containing a subgroup.shuffle_down is compiled, the JIT linker cannot resolve the missing symbols and raises an undefined symbol error.

claude

LGTM — focused CPU-only budget guard with correct arch_is_cpu gate, properly reset per task, and a well-constructed child-process test.

Extended reasoning...

Overview

Touches three files: codegen_llvm.{cpp,h} and test_adstack.py. The change adds a single new field (ad_stack_fn_scope_bytes_) and a per-task accumulator reset in init_offloaded_task_function, then enforces a 256 KB budget via QD_ERROR_IF inside visit(AdStackAllocaStmt*). The test runs the budget-crossing kernel in a subprocess and asserts a non-zero exit code plus the expected guard message in output.

Security risks

No security-sensitive paths are touched. The change adds a compile-time guard that aborts the process on overflow — the same category of safety-check that already exists elsewhere in the codebase. No auth, crypto, or permissions code is involved.

Level of scrutiny

Low-to-moderate. The diff is small and self-contained. The two most important correctness properties — (a) the arch_is_cpu(current_arch()) gate preventing false rejections of valid GPU kernels, and (b) using current_task->name rather than the kernel-level kernel_name in the error message — are both correctly implemented in the current diff. The field-level comment in the header is already properly scoped with "On CPU arches only".

Other factors

All issues surfaced in the review thread (comment inaccuracies, teardown safety, flag-reset, ad_stack_size header mismatch, GPU guard) were addressed in earlier PRs or earlier commits of this PR. The one inline comment from today that lacks a formal "resolved" marker (codegen_llvm.h field comment) reflects code that is already correct in the current diff. The test is gated on both qd.extension.adstack and qd.extension.data64, and the subprocess approach correctly handles the QD_ERROR_IF → std::terminate termination path.

claude

LGTM — the CPU-only budget guard is correctly implemented and the inline comments I raised today were false positives (the code already used current_task->name and the header comment was already CPU-scoped).

Extended reasoning...

Overview

PR #495 adds a CPU-only per-task adstack budget guard to TaskCodeGenLLVM::visit(AdStackAllocaStmt*). The change touches three files: codegen_llvm.cpp (guard logic + per-task accumulator reset), codegen_llvm.h (new ad_stack_fn_scope_bytes_ field), and tests/python/test_adstack.py (child-process test that verifies the guard fires). This is the focused 'codegen budget guard' slice of the autodiff safety series, matching the split that was requested.

Security risks

No security-sensitive code paths touched. The guard is a compile-time check that produces a loud abort rather than silent memory corruption; it cannot be reached from user-supplied data at runtime, only from kernel compilation. No auth, crypto, or permission logic is involved.

Level of scrutiny

Moderate. The guard is CPU-only (correctly gated on arch_is_cpu(current_arch()) in the .cpp; the header comment already notes the gate and the GPU-local-memory distinction), uses current_task->name for per-task identification in the error message, and resets the accumulator at the start of each offloaded task. Earlier rounds of review caught and fixed a real ordering bug (pre_finalize timing), a header-size mismatch (sizeof(int32) vs sizeof(int64)), a missing arch guard, and several comment inaccuracies — all addressed by the author. No bugs were found by the automated system on the current revision.

Other factors

My two inline comments posted today (2026-04-21 07:35 and 08:48) were false positives: the diff already uses current_task->name and the header already carries the 'On CPU arches only' scoping language. The subprocess.run timeout nit in the budget-guard test is a minor style concern; the teardown test's explicit comment explains the team's rationale for relying on pytest's per-test timeout instead, which applies equally here.

… per-task adstack budget

chatgpt-codex-connector bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h Outdated

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated

Comment thread tests/python/test_adstack.py Outdated

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from e281a6e to 196977b Compare April 17, 2026 11:37

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from a5d3009 to 79a56b0 Compare April 17, 2026 11:37

duburcqa mentioned this pull request Apr 17, 2026

[AutoDiff] Autodiff 10: Implement adstack for SPIR-V #490

Open

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 196977b to c73cb3d Compare April 17, 2026 11:44

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch 2 times, most recently from 91f44c5 to 6abc4aa Compare April 17, 2026 11:53

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from c73cb3d to 3d0ecaf Compare April 17, 2026 12:12

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6abc4aa to bedaa69 Compare April 17, 2026 12:12

duburcqa mentioned this pull request Apr 17, 2026

[AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections #500

Open

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 3d0ecaf to 7b52cc8 Compare April 17, 2026 12:18

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bedaa69 to 6ba6b0a Compare April 17, 2026 12:18

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 7b52cc8 to 2a30384 Compare April 17, 2026 12:29

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6ba6b0a to 7e7b15d Compare April 17, 2026 12:29

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 2a30384 to b390eb0 Compare April 17, 2026 12:31

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 7e7b15d to d68e626 Compare April 17, 2026 12:31

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from b390eb0 to 4984a2b Compare April 17, 2026 12:42

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from d68e626 to bd14f25 Compare April 17, 2026 12:43

duburcqa mentioned this pull request Apr 17, 2026

[Bug]: loss.backward() hangs indefinitely for articulated robots with freejoint + child joints Genesis-Embodied-AI/Genesis#2537

Open

hughperkins changed the title ~~[AutoDiff] Surface LLVM adstack push/pop overflow as a Python exception~~ [AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception Apr 17, 2026

hughperkins reviewed Apr 17, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 4984a2b to 1cb9ad7 Compare April 17, 2026 15:57

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bd14f25 to d1f04cc Compare April 17, 2026 15:57

claude bot reviewed Apr 18, 2026

View reviewed changes

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated

claude bot reviewed Apr 18, 2026

View reviewed changes

duburcqa mentioned this pull request Apr 19, 2026

[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics #513

Open

claude bot reviewed Apr 20, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/llvm_runtime_executor.h

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 8a6f423 to 930d6d9 Compare April 21, 2026 06:40

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 2e738b1 to e4c60f6 Compare April 21, 2026 06:40

This was referenced Apr 21, 2026

[AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout #534

Open

[AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception #535

Open

duburcqa changed the base branch from duburcqa/fix_ad_correctness to duburcqa/split_llvm_adstack_runtime_overflow April 21, 2026 06:44

duburcqa changed the title ~~[AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception~~ [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget Apr 21, 2026

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from d1a9c46 to 97533d1 Compare April 21, 2026 06:59

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from e4c60f6 to ba3976d Compare April 21, 2026 06:59